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Abstract 

Recently, very high-dimensional feature representations, 
e.g., Fisher Vector, have achieved excellent performance for 
visual recognition and retrieval. However, these lengthy 
representations always cause extremely heavy computa¬ 
tional and storage costs and even become unfeasible in 
some large-scale applications. A few existing techniques 
can transfer very high-dimensional data into binary codes, 
but they still require the reduced code length to be relatively 
long to maintain acceptable accuracies. To target a better 
balance between computational efficiency and accuracies, 
in this paper, we propose a novel embedding method called 
Binary Projection Bank (BPB), which can effectively re¬ 
duce the very high-dimensional representations to medium¬ 
dimensional binary codes without sacrificing accuracies. 
Instead of using conventional single linear or bilinear pro¬ 
jections, the proposed method learns a bank of small projec¬ 
tions via the max-margin constraint to optimally preserve 
the intrinsic data similarity. We have systematically eval¬ 
uated the proposed method on three datasets: Flickr IM, 
ILSVR2010 and UCFIOI, showing competitive retrieval 
and recognition accuracies compared with state-of-the-art 
approaches, but with a significantly smaller memory foot¬ 
print and lower coding complexity. 


1. Introduction 

Recent research shows very high-dimensional feature 
representations, e.g., Fisher Vector (FV) [23, 27, 22] and 
VLAD [11], can achieve state-of-the-art performance in 
many visual classification, retrieval and recognition tasks. 
Although these very high-dimensional representations lead 
to better results, with the emergence of massive-scale 
datasets, e.g., ImageNet [4] with around 15M images, the 
computational and storage costs of these long data have be¬ 
come very expensive and even unfeasible. For instance, if 
we represent 15M samples using 51200-dimensional FVs, 
the storage requirement of these data is approximately 
5.6TB and it will need about 7.7 x 10^^ arithmetic op- 
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Figure 1. Comparison of the proposed method (projection bank) 
with state-of-the-art ITQ (linear projection) and BPBC (bilinear 
projections), (a-1) The comparison results for retrieval on the UCF 
101 [29] action dataset with around lOK videos. We use IK videos 
as the query set and report the average semantic precisions at the 
top 50 retrieved points. Each video is represented via 170400-d 
FV (Original). Our goal is mainly to compare the results calcu¬ 
lated on binary codes with medium-dimensions (from 1000 bits to 
10000 bits), where is shaded with red color in the figure, (a-2) The 
comparison of storage requirements (double precision) for three 
different projections. For ITQ, it is unfeasible to store the projec¬ 
tions when code length exceeds 10000 bits, (a-3) The comparison 
of coding complexities of different projections, (b) Illustration of 
the three different coding methods. 


erations measuring the Euclidean distance for image re¬ 
trieval on these data. Considering the trade-off between 
computational efficiency and performance, it is desirable 
to embed the high-dimensional data into a reduced feature 
space. However, traditional dimensionality reduction meth¬ 
ods such as PCA [35] are not suitable for large-scale/high- 
dimensional cases. The main reasons are: (1) Most dimen¬ 
sionality reduction methods are based on full-matrix linear 
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projections, which need massive computational complexity 
and memory storage in high-dimensional reduction circum¬ 
stances; (2) The reduced representations are usually real¬ 
valued vectors. When both dimensionality and the num¬ 
ber of samples are large, real-valued codes severely limit 
the efficiency for retrieval and classihcation tasks compared 
with the binary codes. Thus, recent binarization approaches 
[7, 8, 25, 26, 34, 19, 14, 3, 2, 36, 23, 27, 6, 17] have been 
proposed to embed the original data into binary codes with 
a reduced dimension. The codes generated by these meth¬ 
ods can be roughly divided into two groups, i.e., the short 
binary codes and the long binary codes. 

Hashing short codes: Most hashing-based approaches 
designed for fast searching always embed relatively low¬ 
dimensional representations like GIST [21] into short bi¬ 
nary codes (usually under 500 bits) without too much loss 
of information. However, recent sophisticated and state- 
of-the-art representations are always over ten thousand di¬ 
mensions. Therefore, these hashing methods become not so 
effective and appropriate for the embedding of very high¬ 
dimensional data, since they cannot preserve the sufficient 
discriminative properties to maintain high performance if 
the length of the obtained binary codes is short according 
to [23, 27]. Although some of hashing methods can the¬ 
oretically generate long binary codes for high-dimensional 
data, the enormous computational load and memory usage 
make them unpractical. For instance, one of the state-of- 
the-art hashing methods. Iterative Quantization (ITQ) [7], 
leads to unacceptable loss of retrieval accuracy compared 
with long codes over 1000 bits, meanwhile its computa¬ 
tional cost becomes extremely high when the number of bits 
increases as shown in Fig. 1(a). Long binary codes: Oppo¬ 
site to hashing short codes, a few methods [6, 36] have been 
specially introduced to convert the high-dimensional data to 
long binary codes. Among them, one representative binary 
coding method is Bilinear Projection-based Binary Codes 
(BPBC) [6], which can learn two rotate matrices for effi¬ 
cient binary coding. However, to minimize the loss of ac¬ 
curacies and achieve state-of-the-art performance, the left 
of Fig. 1 (a) shows the length of binary codes generated by 
BPBC have to be long enough (i.e., over 10000 bits). Such 
long binary codes are still not fast enough for large-scale ap¬ 
plications. Thus, how to learn medium-length binary codes 
(i.e., between 1000 bits and 10000 bits) and still maintain 
the high accuracies becomes a challenging research topic. 

Many existing binary code learning methods are based 
on a single linear projection matrix (e.g., random projec¬ 
tion) to map the data from the high-dimensional space to a 
reduced space. No matter the training or the coding phase, 
the storage requirement for the single linear projection ma¬ 
trix remains a burden. Taking ITQ (PCA is involved as its 
first step) for an example, to reduce the 170400-dimensional 
FVs to 10000-dimension, the size of the given single pro¬ 


jection matrix should be about 12.7GB and the number of 
multiplications for coding a new data sample is 1.7 x 10®. 
Apparently, this kind of memory requirements and coding 
complexity of the linear projection is unrealistic for large- 
scale applications. To alleviate this weakness, a bilinear 
projection method [6] has been proposed to effectively re¬ 
duce the complexity of code learning compared with the 
linear one. The middle and right of Fig. 1(a) illustrate the 
memory usage and multiplications of coding respectively 
for the linear projection based ITQ and the bilinear projec¬ 
tion based BPBC. Although the size of the projection matrix 
for the bilinear method is dramatically reduced, the coding 
complexity is still relatively high, especially when the di¬ 
mensionality of original data goes high. Therefore, our tar¬ 
get is to further reduce the coding complexity and produce 
medium-length codes without sacrificing the accuracy. 

In this paper, we propose a novel binarization method 
for high-dimensional data. The proposed method hrst de¬ 
aggregates the original very high-dimensional representa¬ 
tions into several groups of short representations according 
to their intrinsic data properties along the dimensions. After 
that, for each group of short representations, a small projec¬ 
tion will be learned via the max-margin constraint to opti¬ 
mally preserve the data similarity. We denote our method 
as Binary Projection Bank (BPB), since a bank of small 
projections will be finally generated in our method instead 
of learning conventional linear or bilinear projections as il¬ 
lustrated in Fig. 1(b). The contributions of this paper in¬ 
clude: (1) We propose a medium-length binary code learn¬ 
ing method, which outperforms state-of-the-art linear and 
bilinear methods; (2) In spite of the reduced code length, 
our method only requires low and constant memory usage 
and coding complexity; (3) A kernelized version (KBPB) 
has also been proposed for better performance. 

2. Related Work 

There are a few works specifically focusing on high¬ 
dimensional data reduction. One of popular methods is 
Product Quantization (PQ) [10]. Prior to PQ, however, a 
random rotation is always needed to balance the variance of 
high-dimensional data according to [11]. As we discussed 
before, such rotation requires high computational com¬ 
plexity. Recently, an efficient high-dimensional reduction 
method based on feature merging [5, 9, 16], termed Pseudo- 
supervised Kernel Alignment (PKA) [15], has achieved 
good performance but with cheaper computation. Besides, 
aiming for large-scale tasks, some binary reduction tech¬ 
niques for high-dimensional data have also been introduced. 
PeiTonnin et al. [23] proposed the “a = 0” binariza¬ 
tion scheme and compared with Locality Sensitive Hash¬ 
ing (LSH) [3] and Spectral Hashing (SpH) [34] on the com¬ 
pressed FVs. Hashing Kernel (HK) [28] is utilized for high¬ 
dimensional signature compression as well in [27]. Most re- 
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Table 1. Storage and coding complexity of different projection 
schemes. The sizes of two matrices in the bilinear projection are 
Di X di and D2 x d2. 

cently, the Bilinear Projection-based Binary Codes (BPBC) 
[6] is proposed to achieve more efficient binary coding. 
However, experiments in [23, 27, 6] manifest these methods 
require very long codes to yield acceptable performance. 

3. Binary Projection Bank 
3.1. Notation and Motivation 

We are given N training data in a I?-dimensional space: 
xi, • • • , xjv C The goal of this paper is to generate 

binary codes for each data point, for which the similarity 
and structure in the original data is preserved. 

Traditional algorithms take ZJ-dimensional data x as in¬ 
put and use a single projection P G to form the 

linear prediction function /i(x) = sgn{P"'"x.). Actually it 
can be regarded that projection matrix P consists of d lin¬ 
ear classifiers (binary output: 0/1) over the original feature 
space. However, for realistic high-dimensional data with 
noise and redundancy of dimension, learning single pro¬ 
jections across the entire high-dimensional feature space 
is unwise and costs very high computational complexity. 
To tackle this problem, we aim to split the original high¬ 
dimensional feature space into d subspaces by merging the 
similar dimensions together similar to [15]. In this way, d 
linear classifiers (i.e., projection vectors) will be explored 
since a subspace spanned by the dimensions with the simi¬ 
lar property should only require one linear classifier. Hence, 
for each of subspace, only one small projection vector will 
be learned in our architecture. Table 1 summarizes the re¬ 
source requirements for different projection schemes. 

In this paper, we propose a Binary Projection Bank 
(BPB) algorithm. To effectively decompose the original 
data space into subspaces, we first employ a K-means clus¬ 
tering scheme along dimensions. Particularly, we take 
x']^, • • • , x/j G as K-means input, which is the rows 

of matrix [xi, • • • , x^r] G and divide them into d 

clusters consisting of x", • • • , x^ € as illustrated: 


As mentioned in [15], the decomposition via K-means can 
successfully preserve the intrinsic data structure and simul¬ 
taneously group the dimension with the similar property to¬ 
gether. We denote the number of the dimensions in the p-th 
cluster by rUp, where p = 1, - ■ ■ ,d. Then the subspace 
spanned by the dimensions in the p-th cluster is In 

each subspace, a linear classifier is learned to generate one 
bit for the data. Hence, D = mi -f • • • -f and d is the 
reduced dimension. In the following, we give the detailed 
formulations and learning steps of BPB. 

3.2. Formulation of BPB 

To preserve the data similarity, we first construct a 
pseudo label £ij for each data pair (xi,Xj) according to 
their k-nearest neighbors in original data space as follows: 

-fl, if Xi e NNk{xj) or xj G NNk{xi) 

— 1, otherwise ’ 

where NNk{xi) is the set of k-nearest neighbors of x^. Be¬ 
sides, we also define in = 1 for i = 1, • • • ,N. In BPB 
learning phase, our goal is to minimize the distances of pos¬ 
itive pairs and maximize the distances of the negative pairs 
in each subspace generated by K-means clustering. 

For the p-th subspace, we use xi(p),--- ,Xiv(p) G 
X 1 Jo represent the data in this subspace. We tactfully 
transfer learning projections to learning linear classifiers via 
pair-wise labels generated from unlabeled data. By adopt¬ 
ing linear classifier /(x) = w^x — b similar to the SVM 
framework, positive pairs are positioned in the same side 
of the hyperplane while negative pairs are expected to be 
placed at different sides of the hyperplane. 

In fact, we can denote x = [x^, —1]^, then the classifier 
becomes /(x) = [w^, 6]x. Therefore, it is equivalent to the 
linear classifier without the bias b. In the following compu¬ 
tation, we omit b and the binary code for each data X(p) in 
the p-th subspace can be acquired as follows: 

hpi^(p)) = sgn{v^fp)X(^p)), ( 3 ) 

where W(p) is the coefficient of the classifier for the p-th 
subspace. With the above requirement and the maximum 
margin criterion for the positive and negative pairs, we have 
the following optimization problem: 
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where |||w(p)|p is for the margin regularization. It is 
noticeable that if i = j, the constraint w^jXj(p) • 
w^jXj(p) > 1 7>ecomes > 1, which strictly 

constrains that every point is out of the margin. Using the 









hinge loss term, we can rewrite the optimization problem in 
(4) as the following objective function; 
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where A is the balance parameter to control the importance 
of the two terms. Since we cannot directly obtain the op¬ 
timal W(p) in our objective function, a gradient descent 
scheme has to be applied here. Let us denote 

Lij{^(p)) = max(0,1 - Vi, j. 

(6) 

Taking the derivative of Lij{'W(j,)) with respect to W(p), we 
can easily obtain the gradient of 


S/Lij{w,^p)) 


0, if 1 - fyW^)X,(p) • < 0 

-i^j (x*(p)xj(p) +Xj(p)X^p)) w(p), else 

( 7 ) 


Note that in our implementation, if 1 — ^y W^^Xi(p) • 
w^jXj(p) = 0, we can set W(p) ^ W(p) + Aw(p), where 
Aw(p) is a small nonzero random vector. The same scheme 
has also been used in [13, 32]. 

Therefore, we utilize the gradient descent method and 
have the following update rule to optimize W(p): 


w(p) ^ w(p) -7 VL(w(p)) 

= W(p)- 7 (w(p) + A^VLij(w(p))), 


where 7 is the step length. 

We repeat the optimization problem in (4) for all the 
d subspaces and concatenate the d binary bits together to 
form the final binary code. The final binary code for high¬ 
dimensional data Xi can be illustrated as: 

Nn(w^i)X,(i)), • • • , s5n(w^)X,(d))], i = !,■■■ ,7V. 


Adaptive gradient descent (AGD): Furthermore, for fast 
convergence, we also associate the optimization procedure 
with an adaptive step length. We first initialize 7=1. For 
the f-th iteration, if L(w|*j) < we enlarge the 

step length 7 ^ I.27 in the next iteration to accelerate the 
convergence, otherwise, we decrease 7 to its half size: 7 ^ 
0.57. In the experiments, we also set an upper bound for 
the number of iteration for the gradient descent. Thus, we 
stop the iteration when the number of iteration reaches a 
maximum or the difference is less 

than a small threshold. 
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Figure 2. Comparison of the average pairwise error on subspaces 
and pairwise error on original data space with respect to reduced 
code length d on the ILSVR2010 dataset with 64000-d FV. All the 
results are the means of 50 runs with 70-iteration of AGD. 

Average pairwise error of subspaces vs. pairwise er¬ 
ror of uncompressed data: We also analyze the em¬ 
pirical error rate of pairwise data in the projected space. 
Suppose is the solution of minimizing L(w(p)) in 
Eq.(5) acquired by AGD for the p-th subspace, p = 
1, • • • ,d. Then the average error rate on pair¬ 
wise data for all the subspaces is defined as Eavg = 

aEp=i i^#{(Lj)|s£F^(w^p)^x,(p))spn(w^p)'^Xj(p)) ^ 

£ij}, where ^ represents the cardinality of the set. On 
the other hand, we also compute the solution w* of min¬ 
imizing L(w) = ^Ijwlp -F AX]i7max(0,1 - %w^Xi • 
for the original Z?-dimensional data {xi, • • • ,xjv}. 
Finally, we compare Eavg with the error rate Eorg = 
j)\sgn{-w*'^x,)sgn{w*'^Xj) ^ %} in Fig. 2. It 
is observed that Eavg is lower than Eorg at the medium code 
length (under 10000 bits), which indicates that the data dis¬ 
tribution in subspaces has much better separability than the 
one in the original space. However, when d ^ D, the num¬ 
ber of dimensions in each subspace will shrink to a very 
small value. In this case, data in subspaces are difficult to be 
linearly separated by classifiers for the current BPB. Thus, 
we will extend BPB to the version with non-linear kernels 
for better performance. 

3.3. Kernel BPB 

In this section, we introduce our algorithm with kernel 
functions, i.e., kernel BPB (KBPB), since the kernel method 
can theoretically and empirically be able to solve the linear 
inseparability problem mentioned in above. Although the 
kernel method would cost high computational complexity 
for high-dimensional data, in our method, the kernel func¬ 
tion will only be performed in small subspaces which are 
spanned by the dimensions in each cluster. 

In the p-th subspace, suppose data are mapped to a 
Hilbert space by a mapping function (p and the kernel func¬ 
tion /c(xi(p),Xj(p)) = ^(Xi(p)) • </>(xj(p)) is the inner prod¬ 
uct function in the Hilbert space. As defined in the Kernel- 
ized Locality-Sensitive Hashing (KLSH) [12] and Kernel- 
Based Supervised Hashing (KSH) [18], we uniformly select 
n samples from the training data (we call them basis sam¬ 
ples) to reduce the coding complexity of the kernel trick 








from 0{Nd) to 0{nd) (the effect on selection of n will be 
discussed in experiments). Then we establish the prediction 
function with the kernel k as follows (without loss of 
generality, suppose the choices of basis samples are the first 
n samples xi, • • • , x„ for all the d subspaces); 

71 n 

5 p(x) = ^a,«:(xi(p),x) - b = ^a,())(xi(p))'^ 0 (x) - b, 

(9) 

where Oi G K, i = 1 , • • • , n are the coefficients and 6 G K 
is the bias. It is actually the linear classifier for the data 
(/)(x) in the Hilbert space. The binary codes with sufficient 
information should be zero-centered [18, 12, 7], which ren¬ 
ders that dpi^jip)) — 0 - To satisfy this condition, we 

set & = ^ a»K(xi(p), Xj(p)). Introducing b in 

Eq. (10), the prediction function becomes: 

n n 

5p(x) = ai(/)(xj(p))^(^(x) - Y ( 10 ) 

2=1 2=1 

where gt = J2]=i «(x*(p),Xj(p)), i = !,■■■ ,N. It is 

easy to observe that (^i4>{'^i(p)) is the coefficient vec¬ 
tor of the hyperplane for the data (/)(x) in the Hilbert space. 
With the similar constraints, we have the following opti¬ 
mization problem for KBPB; 

1 "■ 

min-11 E a*<^(x*(p))f, (11) 

2 = 1 

s.t. fij( 7 p(xj(p))pp(xj(p)) > 1 , i, j = 1, ■ ■ ■ ,Ti, 

where a = (ai,--- ,a„)^. Naturally, the corresponding 
objective function will be; 

1 ” 

T(a) = 2 li 

2 = 1 

+ A ^ ^ max (0,1 £ijgp(yii(^p^)gp['Kj^p^')'j 

^ n n 
2=1 i = l 

+ A ^ max (0,1 - l,jgp{^,^p>^)gp{ 
id 

where A is the balance parameter as in BPB. Let us de¬ 
note Ti (a) = 5 Er=iE”=iaiai<^(xi(p))'^(/'(xj(p)) and 
Zy (a) = max(0, l-% 5 p(x,(p))pp(xj(p))), i,j = 

1, • • • ,n. Then their derivatives with respect to a can be 
computed as: 


VLi(a) 


(2l/v(xi('p^, X]^^pJ ) -j- , Xj-^pJ ) 

anK(x„(p), X„(p) ) -f d-j k(x„(^) , } 


and 


VLij(a) 


0, if 1 — £ij (a'^ki — b) (a^k^ — 6) < 0 

' - (^(ij (ki - fi) (kj - fi)’^a 

+£ij{kj - fi){ki - /x)^aj , else 


where /r = [/zi, • • • ,and k^ = 

[^(xi(p), Xj^pj), * * * , Ar(x.^^p^, x^fpj)] , i 1 ; * * ■ ; n.. 

Similar to BPB, if 1 — £yPp(Xj(p))pp(xj(p)) = 0, we can 
set a ^ a -f Aa, where Aa is a small nonzero random 
vector. Therefore, we have the update rule for KBPB as 
follows: 

a ^ a — 7 ^VLi (a)-f A^VT. 7 (a)), ( 12 ) 

id 

where 7 is the step length, which is also adaptively tuned 
by AGD. 

Finally, having calculated the coefficients of the kernel- 
ized prediction function for all the d subspaces, the binary 
codes for the original data x^ can be expressed as: 


[spn( 5 i(xj(i))),--- ,sgn{gd{xi(d)))], * = 1, • • • ,N. 


4. Experiments 

4.1. Large-scale image retrieval 

The proposed BPB and KBPB algorithms are first eval¬ 
uated for the image similarity search task. Two realis¬ 
tic large-scale image datasets are used in our experiments: 
Flickr IM and ILS VR2010. For Flickr IM, we downloaded 
close to one million web images with 55 groups from Flickr 
inspired by [30, 27]. For each image in Flickr IM, we ex¬ 
tract 128-d SIFT features in patches of 16 x 16 around inter¬ 
est points detected by [20]. The ILSVR2010' dataset is a 
subset of the ImageNet [4] dataset and contains 1.2 million 
images from 1000 categories. The publicly available dense 
128-d SIFT features [4] are used. 

We represent each image in both datasets using two 
high-dimensional representations: Fisher Vector (FV) and 
VLAD. In respect to FV, the Gaussian Mixture Model is 
implemented on SIFT features with 250 Gaussians for both 
datasets. In this way, the dimension of the final FV for each 
image is 2 x 250 x 128 = 64000. While, for VLAD repre¬ 
sentations, the K-means clustering has been used to cluster 
the SIFT features into 250 centers and aggregate them into 
VLAD vectors of 250 x 128 = 32000 dimensions. These 
VLAD vectors are also power and normalized [24]. In 
terms of both datasets, we randomly select 1000 images 
as the query and the remaining images are regarded as the 
gallery database. For evaluation, we first report the seman¬ 
tic precision at 50 and 100 retrieved images (according to 

'http://www.imagenet.org/challenges/LSVRC/2010/index 







the ground-ti'uth) for both Flickr IM and ILSVR2010, and 
then the precision-recall curves are illustrated as well. Ad¬ 
ditionally, we report the size of projection storage and the 
coding time (the average time used for each data) for some 
state-of-the-art methods. Our experiments are completed 
using Matlab 2014a on a server conhgured with a 6-core 
processor and 64GB of RAM running the Linux OS. 

Compared methods and settings: In our experiments, 
we compare the proposed method with nine coding methods 
including four real-valued dimensionality reduction meth¬ 
ods: Principal Component Analysis (PCA), the projection 
via Gaussian random rotation (RR), Product Quantization 
(PQ) [10] and Pseudo-supervised Kernel Alignment (PKA) 
[15], and hve binary coding methods: the sign function bi- 
narization, “a = 0” binarization [23], Locality Sensitive 
Hashing (LSH) [3], Spectral Hashing (SpH) [34], Bilinear 
Projection-based Binary Codes (BPBC) [6] and Circulant 
Binary Embedding (CBE) [36]. We use the publicly avail¬ 
able codes of LSH, SpH, PQ , CBE and PCA, and imple¬ 
ment RR, PKA and BPBC ourselves. Additionally, two nat¬ 
ural baselines: randomly sampling the dimensions to form 
subspaces without replacement (RandST-tBPB) and learn¬ 
ing multiple bits with ITQ in each subspace (Kmeans-tlTQ) 
are also included in our experiments. All of the above 
methods are then evaluated for compressing EV and VLAD 
representations into three different medium-lengthed codes: 
(8000, 6400,4000; 4000, 3200, 2000). Considering the fea¬ 
sibility on the training phase of all the methods, in this ex¬ 
periment, 150K data are randomly selected from the gallery 
database of Elickr IM and ILSVR2010 respectively to form 
the training set. Besides, we also randomly choose another 
50K data samples from each of the datasets as a cross- 
validation set for parameter tuning. Under the same ex¬ 
perimental setting, all the parameters used in the compared 
methods have been strictly chosen according to their origi¬ 
nal papers. 

Eor the proposed BPB/KBPB, the pairwise label of each 
data pair is determined by their 100 nearest neighbors. The 
balance parameter A for each dataset is selected from one 
of the values in the range of [10“^, 10^], which yields the 
best performance on the cross-validation set. The maximum 
number of the iteration of AGD is fixed at 70, which has 
been proved to converge well for the objective function. Eor 
KBPB, we adopt n = 1500 as the number of basis samples 
for both Elickr IM and ILSVR2010. We use the polynomial 
kernel K(xi(p), Xj(p)) = (x^p)Xj(p) -f 1)'^ and the RBE ker¬ 
nel k(x,(p),Xj(p)) = exp(-||x,(p),Xj(p)||2/cr2) to imple¬ 
ment KBPB^ and KBPB^, respectively. The best value of r 
for KBPB^ is selected via cross-validation and the value of 
a for KBPB^ is determined adaptively based on the method 
in [1]. In fact, any kernel function satisfying the Mercer’s 
condition can be used in KBPB. In BPB/KBPB, since the 
coding procedure in each subspace is independent, we im¬ 


plement the parallel computation scheme to speed up the 
training time. Considering the uncertainty of the K-means 
clustering, all the reported results by our methods are the 
averages of 50 runs. 

Results comparison: We list the retrieval results compar¬ 
ison of different methods at top 50 and 100 retrieval results 
on the Flickr IM and ILSVR2010 datasets in Table 2 and 
Table 3, respectively. Generally, FV gains slightly better 
results than VLAD on both datasets. Meanwhile, the ac¬ 
curacies on the ILSVR2010 dataset are lower than those 
on the Flickr IM dataset, since there are more categories 
and larger intra-class variations in ILSVR2010. It is no¬ 
ticeable that PQ achieves the low precision on Flickr IM, 
while RRh-PQ can lead to more reasonable results. The 
reason is that for high-dimensional representations, there 
may exist unbalanced variance that influences the perfor¬ 
mance. Thus, randomly rotating the high-dimensional data 
prior to PQ“ is recommended in [11]. Nevertheless, due to 
that the images in ILSVR2010 are textured with the dom¬ 
inant object which leads to relatively balanced variance, 
the basic PQ can achieve modest results on ILSVR2010. 
PCA and PKA have remarkable accuracies as real-valued 
compression techniques on both datasets and CBE is re¬ 
garded as the strongest baseline of binary coding methods 
according to its performance. LSH, SpH and the “a = 0” 
scheme can obtain similar results on both datasets and us¬ 
ing sign function directly on uncompressed EV/VLAD is 
proved to be the worst binarization method. Addition¬ 
ally, KmeansH-ITQ(20bits) can achieve slightly better per¬ 
formance than RandST-tBPB, but both signihcantly lower 
than BPB. 

Erom Table 2 and Table 3, our BPB algorithm consis¬ 
tently outperforms all the compared methods at every code 
length and leads to competitive accuracies with CBE and 
original EV/VLAD. Moreover, KBPB can achieve better 
performance than BPB since the kernel method can theo¬ 
retically and empirically solve the problem of linear insep¬ 
arability of subspaces with relatively lower dimensions (av¬ 
erage dimension of each subspace is D/d). Thus, KBPB 
gives signihcantly better performance when d is large, i.e., 
on relatively long binary codes. The best performance on 
both datasets has been achieved by KBPB with the RBE 
kernel. Especially, when the code length decreases, the 
retrieval accuracies from all compared methods (expect 
SpH) dramatically drop, but the accuracies of our meth¬ 
ods only slightly change showing the robustness of the 
proposed methods on medium-dimensional binary coding. 
Currently, we use hard-assignment K-means for our work. 
In Fig. 3, we have also evaluated the possibility to use 
soft-assignment clustering for our methods. The results 

^In [10], PQ can achieve competitive results without random rotation. 
However, they focus on relatively low-dimensional SIFT/GIST features 
whose variance already tends to be roughly balanced. 



Table 2. Retrieval results (semantic precision) comparison on Flickr IM with 64000-ciimensional FV and 32000-dimensional VLAD. 


Methods 


Fisher Vector (t)4U()()-d) 


VLAD {32000-d) 

t'recisiontfi’top au 

i'recisionts'iop lou 


urecisioiHffiop ao 

I'recisionte'top luo 

8UUU bit 

64UU bit 

4UUU bit 

SUUU bit 

64UU bit 

4UUU bit 


4UUU bit 

J2UU bit 

2UUU bit 

4UUU bit 

J2UU bit 

2UUU bit 








Original 


0383 

0383 

0383 

0353 

0335 

0335 


0370 

0370 

0370 

0339 

0339 

0339 

PCA 


0371 

0352 

0314 

0354 

0.316 

0384 


0.365 

0341 

0302 

0337 

0395 

0361 

PQ 


0.160 

0.132 

0.121 

0.154 

0.128 

0.111 


0.142 

0.114 

0.103 

0.138 

0.109 

0.090 

PKA 


0.380 

0.374 

0.320 

0.352 

0.337 

0.301 


0.376 

0.366 

0.307 

0.335 

0.317 

0.276 

RR-fPQ 


0.292 

0.279 

0.251 

0.283 

0.268 

0.234 


0.291 

0.267 

0.233 

0.265 

0.249 

0.213 

Slgii 


0381 

0381 

0381 

0377 

0377 

0377 


0371 

0371 

0371 

0362 

0362 

0362 

a = 0 


0.273 

0.273 

0.273 

0.262 

0.262 

0.262 




- 


- 


LSH 


0.319 

0.294 

0.270 

0.304 

0.289 

0.267 


0.315 

0.287 

0.250 

0.287 

0.272 

0.244 

SpH 


0.259 

0.288 

0.301 

0.244 

0.273 

0.296 


0.254 

0.267 

0.277 

0.225 

0.240 

0.265 

BPBC 


0.343 

0.338 

0.314 

0.328 

0.303 

0.289 


0.328 

0.312 

0.294 

0.311 

0.281 

0.267 

CBE 


0.382 

0.379 

0.377 

0.356 

0.351 

0.342 


0.360 

0.351 

0.342 

0.332 

0.327 

0.321 
















Kmeans-l-1 lQ(2Ubits) 


0.365 

0.338 

0.313 

0.338 

0.320 

0.310 


0.363 

0.350 

0.332 

0.324 

0.311 

0.299 

RandST-i-BPB 


0.352 

0.335 

0.321 

0.331 

0.316 

0.303 


0.351 

0.345 

0.331 

0.319 

0.305 

0.296 

BPB 


0.385 

0.381 

0.376 

0.356 

0.349 

0.345 


0.365 

0.357 

0.350 

0.338 

0.333 

0.329 

KBPB^ 


0.391 

0.388 

0.375 

0.358 

0.353 

0.347 


0.370 

0.362 

0.355 

0.342 

0.340 

0.332 

KBPB® 


0.398 

0,391 

0.379 

0,367 

0.359 

0.350 


0.378 

0.376 

0,363 

0.349 

0,345 

0.336 


The “Origina]” indicates uncompressed FVA^LAD. The “Sign” refers to directly using the sign function on original vectors, “o: = 0" [- ^] scheme is specifically designed for FV and the dimension of 
reduced codes via “o: = 0" is fixed at (128 + 1) X 250=32250. KBPB^ indicates KBPB with the polynomial kernel and KBPB^ indicates KBPB with the RBF kernel. The results of BPB and KBPB 
are mean accuracies of 50 runs. For Original, PCA, PKA and RR, the Euclidean distance is used to measure the retrieval. For RR+PQ, the asymmetric distance (ASD) [10] is adopted and Hamming distance 
is used for the rest of compared methods. Kmeans-i-ITQ(20bits) indicates using Kmeans to split the dimensions into subspaces and then apply ITQ to learning 20 bits codes for each subspace. RandST-i-BPB 
denotes randomly split the dimensions into subspaces without replacement and adopt BPB optimization scheme to learn codes. 


Table 3. Retrieval results comparison (semantic precision) on ILSVR2010 with 64000-dimensional FV and 32000-dimensional VLAD. 


Methods 


Fisher Vector (6400U-d) 


-VLAD (32000-d) 

rrecisionts^top su 

Jr’recisiont&'iop luu 


i'recisiont&'iop au 

t'recisiontfi'top luu 

8UUU bit 

64UU bit 

4UUU bit 

8UUU bit 

64UU bit 

4UUU bit 


4UUU bit 

J2UU bit 

iUUU bit 

4UUU bit 

32U0 bit 

2UUU bit 








Original 


0314 

0314 

0314 

0375 

0375 

0375 


0385 

0386 

0385 

0349 

0349 

0349 

PCA 


0385 

0377 

0359 

0354 

0338 

0325 


U.l /6 

0.160 

0343 

0339 

U.l 16 

030T 

PQ 


0.157 

0.153 

0.138 

0.132 

0.125 

0.110 


0.154 

0.142 

0.127 

0.115 

0.106 

0.089 

PKA 


0,214 

0.205 

0.183 

0.179 

0.168 

0.155 


0.188 

0.172 

0.157 

0.152 

0.143 

0.124 

RR+PQ 


0.182 

0.163 

0.157 

0.151 

0.137 

0.128 


0.174 

0.166 

0.151 

0.132 

0.117 

0.108 

Sign 


0352 

0352 

0352 

0331 

0331 

0331 


arm 

frm 

(rm 

0309 

0309 

0309 

q: = 0 


0.175 

0.175 

0.175 

0.144 

0.144 

0.144 



- 



- 

- 

LSH 


0.185 

0.171 

0.160 

0.162 

0.155 

0.143 


0.175 

0.159 

0.143 

0.144 

0.128 

0.117 

SpH 


0.160 

0.169 

0.153 

0.139 

0.151 

0.123 


0.136 

0.152 

0.149 

0.108 

0.127 

0.120 

BPBC 


0.187 

0.183 

0.176 

0.165 

0.153 

0.146 


0.180 

0.173 

0.154 

0.139 

0.127 

0.120 

CBE 


0.214 

0.207 

0.197 

0.178 

0.164 

0.155 


0.184 

0.174 

0.167 

0.145 

0.138 

0.131 
















Kmeans+11 Q(2()bits) 


0.201 

0.197 

0.184 

0.167 

0.158 

0.150 


0.172 

0.165 

0.155 

0.137 

0.131 

0.125 

RandST+BPB 


0.202 

0.193 

0.182 

0.161 

0.149 

0.140 


0.175 

0.164 

0.155 

0.138 

0.129 

0.121 

BPB 


0.210 

0.205 

0.197 

0.175 

0.166 

0.158 


0.186 

0.177 

0.165 

0.148 

0.142 

0.134 

KBPB^ 


0.218 

0.208 

0.201 

0.177 

0.172 

0.163 


0.190 

0.181 

0.171 

0.151 

0.145 

0.137 

KBPB® 


0.226 

0.221 

0.208 

0,181 

0.176 

0.168 


0.198 

0.190 

0.180 

0.157 

0,150 

0.143 


illustrate that for the medium-dimensional codes (i.e., be¬ 
tween 1000 bits and 10000 bits), soft-assignment cluster¬ 
ing based BPB and RBF-KBPB can achieve competitive re¬ 
sults with hard-assignment BPB and KBPB. However, in 
the extreme condition (i.e., code dim—^feature dim), the 
soft-clustering based methods can still produce the reason¬ 
able results without lossing much of accuracy, while the 
current hard-clustering methods fail, since the feature di¬ 
mensions can be re-used during soft-clustering. Thus, from 
Fig. 3, Table 2 and Table 3, we can observe our current ver¬ 
sion of projection bank can indeed achieve better perfor¬ 
mance for the medium-dimensional codes compared with 
other methods. Besides, Fig. 4 presents the precision-recall 
curves of all compared methods on both datasets with 8000 
bits for FV and 4000 bits for VLAD, respectively. From all 
these figures, we can further discover that, for both datasets, 
BPB/KBPB outperform other high-dimensional compres¬ 
sion methods with the medium-lengthed codes by compar¬ 
ing the retrieval precision and the Area Under the Curve 
(AUC). 

Complexity and parameter sensitivity analysis: Table 4 
illustrates the comparison of the memory usage for projec¬ 
tions, the training time and the coding time on ILSVR2010. 
RRh-PQ costs the largest memory space and more time 
for coding since the full-matrix projection (i.e., RR) is in¬ 
volved. Compared with RRh-PQ, BPBC and CBE need 
much lower memory costs and time complexity for training 


Flickr 1M with 64000-cl FV 



Figure 3. Comparison with soft-kmeans on BPB 


Table 4. Comparison of computational cost on different code 
lengths for 64000-d FV reduction on ILSVR2010 (Data stored in 


double precision). 


Code 

length 

Measurement 

RR 

RR+PQ 

BPBC 

CBE 

BPB 

KBPB 

3200 

Projections 

1562.53 

1576.37 

0.34 

0.49 

0.49 

24.40 

4000 

storage 

1953.19 

1970.49 

0.42 

0.49 

0.49 

30.51 

6400 

(MB) 

3125.04 

3152.50 

0.65 

0.49 

0.49 

48.83 

8000 


3906.38 

3940.38 

0.80 

0.49 

0.49 

61.17 

3200 

Training 

2.21 

1102.72 

848.21 

304.15 

314.27 

603.52 

4000 

time 

3.30 

1563.48 

1021.32 

340.24 

389.56 

718.33 

6400 

(s) 

4.52 

19i8.52 

1336.10 

422.44 

472.33 

802.51 

8000 


8.64 

2204.30 

1559.08 

464.23 

505.16 

890.24 

3200 

Coding 

246.71 

599.47 

5.46 

21.64 

0.24 

2.41 

4000 

time 

370.24 

791.43 

13.45 

22.32 

0.23 

6.64 

6400 

(ms) 

534.10 

1182.57 

30.57 

21.03 

0.24 

19.10 

8000 


721.51 

1603.10 

80.02 

23.15 

0.24 

55.91 


The training and coding time in this table are both 50-run averaged runtime. Parallel computation is adopted to speed 
up our training phase. 


and coding. Our BPB is slightly time-consuming than CBE 
in the training phase but the most efficient one for coding. 
Meanwhile, KBPB costs more memory space than BPBC 
and BPB but is still more efficient for coding than BPBC. 
In addition, Pig. 5 reports the effect of performance by vary¬ 
ing two essential parameters n and A. In terms of the num- 















































































(a) Flicki- IM (FV) 


(b) Flickr IM (VLAD) 


(c) ILSVR2010 (FV) 


(d) ILSVR2010 (VLAD) 


Figure 4. Comparison of precision-recall curves on Flickr IM and ILSVR2010 datasets with 8000 bits FV and 4000 bits VLAD. 



(a) Flickr IM (FV) (b) Flickr IM (FV) (c) ILSVR2010 (VLAD) (d) ILSVR2010 (VLAD) 

Figure 5. (a) and (c) show the mean of 50 runs of retrieval accuracies of KBPS (with the RBF kernel) vs. parameter n on Flickr IM and 
ILSVR2010. (b) and (d) show the parameter sensitivity analysis of A on Flickr IM and ILSVR2010 at 6400 bits and 3200 bits, respectively. 


ber of basis samples n used in KBPB with the RBF kernel, 
when n > 1000, the retrieval accuracy curves become ap¬ 
proximately stable on both datasets with FV and VLAD, 
respectively. It indicates that our KBPB can lead to rela¬ 
tively robust results with n > 1000. As we can see, for 
balance parameter A, our methods (both BPB and KBPB) 
can achieve the good performance when A S (10“^, 1) and 
A e (1,10) on Flick IM and ILSVR2010, respectively. 

4.2. Large-scale action recognition 

Finally, we evaluate our methods for action recognition 
on the UCFIOI dataset [29] which contains 13320 videos 
from 101 action categories. We strictly follow the 3-split 
train/test setting in [29] and report the average accuracies as 
the overall results. The 426-dimensional default Dense Tra¬ 
jectory Features (DTF) [31] are extracted from each video, 
and GMM and K-means are used to cluster them into 200 
visual words for FV and VLAD respectively. Thus, the 
length of FV is 2 x 200 x 426 = 170400 and the length 
of VLAD is 200 x 426 = 85200. For our methods, we fix 
n = 500 and A = 8, which are both selected via cross- 
validation set, and other parameters are the same as the pre¬ 
vious retrieval experiments. In this experiment, we apply 
the linear SVM^ for action recognition. From the relevant 
results shown in Table 5, it can be observed that the recog¬ 
nition accuracies computed by all methods have generally 
smaller differences compared with the diversity of perfor¬ 
mance in retrieval tasks. The reason is that the supervised 
SVM training can compensate the discriminative power be¬ 
tween different methods, whereas the unsupervised retrieval 

^According to [27, 6], hashing kernel [28, 33] renders to an unbiased 
estimation of the dot-product in the original space. Thus, binary codes can 
also be directly fed into a linear SVM. 


Table 5. Comparison of action recognition performance (%) on the 
UCF 101 dataset._ 


Methods 


Fisher Vector (170400-d) 

VLAD (85200-d) 


17040 bit 

11360 bit 

8520 bit 

8520 bit 

5680 bit 

4260 bit 









Original 


80.33 

80.33 

80.33 

77.95 

77.95 

77.95 

PCA 


78.62 

78.31 

75.4 

77.03 

76.28 

74.1 

RR-kPQ 


77.25 

77.67 

75.50 

75.38 

75.21 

74.03 

PKA 


80.30 

78.88 

76.54 

77.21 

77.00 

76.4 

PQ 


75.90 

74.84 

74.31 

72.85 

72.01 

70.99 

sign 


75.26 

75.26 

75.26 

74.41 

74.41 

74.41 

a = 0 


76.78 

75.20 

74.56 




LSH 


74.i9 

73.02 

71.88 

72.40 

71.11 

70.4 

SpH 


71.36 

73.04 

75.28 

69.35 

72.97 

74.83 

BPBC 


77.21 

76.40 

75.89 

75.91 

74.73 

73.22 

CBE 


80.65 

78.23 

76.47 

77.91 

75.34 

74.03 









BPB 


80.02 

79.26 

78.30 

77.53 

76.38 

75.52 

KBPB^ 


80.74 

80.35 

79.37 

78.28 

77.31 

76.54 

KBPB^ 


82.18 

81.55 

80.71 

78.69 

77.52 

76.90 


cannot. Our BPB and KBPB can not only achieve compet¬ 
itive results with original features, but also perform better 
than other compression methods on medium-lengthed codes 
with FV and VLAD. Moreover, KBPB^ consistently gives 
the best performance. 

5. Conclusion and Future Work 

In this paper, we have presented a novel binarization 
approach called Binary Projection Bank (BPB) for high¬ 
dimensional data, which exploits a group of small projec¬ 
tions via the max-margin constraint to optimally preserve 
the intrinsic data similarity. Different from the conven¬ 
tional linear or bilinear projections, the proposed method 
can effectively map very high-dimensional representations 
to medium-dimensional binary codes with a low memory re¬ 
quirement and a more efficient coding procedure. BPB and 
the kernelized version KBPB have achieved better results 
compared with state-of-the-art methods for image retrieval 
and action recognition applications. In the future, we will 
focus more on using soft-assignment clustering based pro¬ 
jection bank methods. 
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